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ABSTRACT 

Sharing real-time aggregate statistics of private data has 
given much benefit to the pubhc to perform data mining 
for understanding important phenomena, such as Influenza 
outbreaks and traffic congestions. We propose an adaptive 
approach with sampling and estimation to release aggre- 
gated time series under differential privacy, the key inno- 
vation of which is that we utilize feedback loops based on 
observed (perturbed) values to dynamically adjust the es- 
timation model as well as the sampling rate. To minimize 
the overall privacy cost, our solution uses the PID controller 
to adaptively sample long time-series according to detected 
data dynamics. To improve the accuracy of data release per 
timestamp, the Kalman filter is used to predict data values 
at non-sampling points and to estimate true values from per- 
turbed query answers at sampling points. Our experiments 
with three real data sets show that it is beneficial to in- 
corporate feedback into both the estimation model and the 
sampling process. The results confirmed that our adaptive 
approach improves accuracy of time-series release and has 
excellent performance even under very small privacy cost. 

1. INTRODUCTION 

Sharing real-time aggregate statistics of private data has 
given much benefit to the public to perform data mining for 
understanding important phenomena. Consider the follow- 
ing examples of data aggregation and mining applications: 

Disease Surveillance A health care provider, such 
as an Emergency Department, gathers data from indi- 
vidual visitors. The collected data, e.g. daily number 
of Influenza cases, is then shared with a third party, for 
instance, researchers, in order to monitor and to detect 
possible seasonal epidemic outbreaks at the earliest. 

Traffic Monitoring A GPS service provider gathers 
data from a set of individual users about their loca- 
tions, speeds, mobility, etc. The aggregated data, for 
instance, the number of users at each region during 
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Figure 1: Aggregate data sharing scenario 



each time period, can be mined for commercial inter- 
est, such as popular places, as well as public interests, 
such as congestion patterns in roads. 

In general, such aggregate data sharing applications have 
a similar scenario as shown in Figure [l] In this scenario, a 
central trusted component gathers data from a large number 
of individual subscribers. The collected data may be then 
aggregated and continuously shared with other un-trusted 
entities for various purposes. The trusted server, i.e. pub- 
lisher, is assumed to be bound by contractual obligations 
to protect the user's interests, therefore it must ensure that 
releasing the data does not compromise the privacy of any 
individual who contributed data. The goal of our work is to 
enable the publisher to share useful aggregate statistics over 
individual users continuously (aggregate time series) while 
guaranteeing their privacy. 

The current state-of-the-art paradigm for privacy-preserving 
data publishing is differential privacy. Differential privacy 
requires that the aggregate statistics reported by a data pub- 
lisher be perturbed by a randomized algorithm A, so that 
the output of A remains roughly the same even if any single 
tuple in the input data is arbitrarily modified. This ensures 
that given the output of A, an adversary will not be able 
to infer much about any single tuple in the input, and thus 
privacy is protected. 

Most existing work on differentially private data release 
deal with one- time release of static data [sj [oj [iT] [23) [25] 
[26] . In the applications we consider, high- volume data are 
acquired dynamically. In the aggregate time series, the data 
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Figure 2: Adaptive sampling with Traffic data 



values at successive timestamps can be highly correlated, 
which makes solutions designed for static data problematic. 
A standard differential privacy mechanism can be applied 
to perturb the value at each timestamp. Due to the corre- 
lation between the values and the composition theorems of 
differential privacy ^ , it can lead to an overall perturba- 
tion error of B(T), where T is the length of the time series, 
which severely limits the utility of the published data if T 
is very large. 

Few recent works 5, 9, 23 studied the problem of releas- 
ing time series or continual statistics . Rastogi and Nath 23 
proposed an algorithm which perturbs the Discrete Fourier 
Transform (DFT) of the entire time series and reconstructs a 
released version from the Inverse DFT. Since the entire time- 
series is required to perform those operations, the timeliness 
of publishing is greatly impacted, limiting its applicability 
for real-time disease surveillance and traffic monitoring ap- 
plications. Moreover, this solution also suffers reconstruc- 
tion error when calculating the Inverse DFT to recover the 
original time-series. 

Dwork et al. [o] proposed a differentially private continual 
counter over a binary stream with a bounded error at each 
time step k being 0(^(log /c)"^"^) where a is the degree of 
differential privacy provided. Chan et al. [s] studied the 
same problem and concluded with a similar upper bound. 
Both works adopt an event-level privacy model, with the 
perturbation mechanism designed to protect the presence of 
an individual event, i.e. a user's contribution to the data 
stream at a single time point, rather than the presence or 
privacy of a user. 

Our Contributions. In this paper, we propose a novel 
adaptive approach with sampling and estimation for releas- 
ing time series under differential privacy. It uses sampling 
to query and perturb selected values in the time series with 
the differential privacy mechanism, and simultaneously uses 
prediction and estimation to dynamically predict the non- 
sampled values and correct the sampled values. Apply- 
ing perturbation only to sampled values reduces the overall 
perturbation noise under a given differential privacy con- 
straint. The prediction and estimation aims to reduce the 
prediction error at non-sampled points and to reduce the 
impact of the perturbation noise at sampled points. The 
key innovation is that it utilizes feedback loops based on ob- 
served (perturbed) values to dynamically adjust the predic- 
tion/estimation model and the sampling rate. To this end, 
we examine two challenges in our system: predictability and 



controllability. The former raises the question: given a per- 
turbed observation at each time point, can we formulate an 
estimate which is close to the true value and dynamically 
adjust the estimation model based on current observation? 
The latter imposes another question: suppose an accurate 
estimate can be derived at any time step, can we dynami- 
cally adjust the sampling rate according to the rate of data 
change? We propose a solution to address these two issues 
and we summarize our contributions below. 

1) To improve the accuracy of data release at each times- 
tamp, we propose to use the Kalman filter [13 which is 
widely adopted for signal recovery, to estimate the original 
data values based on observed values. By assuming a pro- 
cess model that generates the time series, one advantage of 
the Kalman filter is that it reduces the impact of pertur- 
bation errors introduced by differential privacy mechanism. 
This is achieved by linearly combining a prediction gener- 
ated by the process model and the perturbed observation. 
The combined value, referred to as aposteriori estimate, is 
a minimum variance estimate, which provides an educated 
guess rather than a pure perturbed value. The estimate is 
then fed back to the system for future predictions and for 
dynamically adjusting the sampling process. 

2) To minimize the overall privacy cost, hence, the over- 
all perturbation error, we propose an adaptive sampling al- 
gorithm which adjusts the sampling rate using a PID con- 
troller. Without assuming a model for the sampling process, 
a PID controller, which is the most common form of feed- 
back controller, is in place to detect data dynamics from 
estimates by the Kalman filter and to increase sampling fre- 
quency when data is going through rapid changes. Figure [2] 
illustrates the idea of adaptive sampling. We plot the origi- 
nal time-series, traffic count, as well as the number of queries 
issued by the adaptive sampling mechanism during each cor- 
responding time unit. As is shown, the number of queries 
issued by our adaptive approach increases between day 50 
and day 100, when the traffic count exhibits significant fluc- 
tuations, and it drops beyond day 100, when there's little 
variation among the original data values. 

3) We empirically study the accuracy and robustness of 
our approach with real time-series data sets. Our experi- 
ments show the proposed solution provides real-time accu- 
rate results and stability despite different data dynamics. 
We believe our solution is applicable to a wider range of 
applications. 

The rest of the paper is organized as follows: Section 
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2 provides the background for differential privacy and de- 
scribes the baseline method and an existing approach. Sec- 
tion 3 presents an overview of our proposed solution. Sec- 
tion 4 and 5 present the technical details of the Kalman 
filter based estimation and PID controller based adaptive 
sampling respectively. Section 6 present a set of experimen- 
tal results. We review related existing works in Section 7. 
Finally, Section 8 concludes the paper and states possible 
directions for future work. 

2. PRELIMINARIES 

In this section we will discuss problem setup, differential 
privacy that we use as our privacy definition, and also re- 
view existing perturbation techniques to achieve differential 
privacy on time-series data. 

2.1 Problem Statement 

We consider time-series data consisting of aggregate val- 
ues from a set of individuals (such as people visiting a par- 
ticular hospital). Formally we define a time series X as 
follows: 

Definition 1. [Aggregate Time Series] A univariate, dis- 
crete time series X = {x^} is a set of values of a variable x 
observed at discrete time k, with < k < T, where T gives 
the lifetime of the series. 

In suggested applications, X is an aggregate count series, 
such as, the daily total of patients diagnosed of Influenza, 
or the hourly count of drivers passing by a gas station. This 
assumption will hold true for all baseline algorithms as well 
as our proposed approach. 

A trusted aggregator who has access to the entire time- 
series, when sharing it with others, needs to release a high 
quality version from the original series, yet without compro- 
mising the privacy of any individual participant. We mea- 
sure the quality of a published series by average relative 
error: 

Definition 2. [Average Relative Error] The average rela- 
tive error, denoted by E, of a published series R = {^fe} 
derived from original time-series {xk} is given by the fol- 
lowing equation: 



^ = — ^ |rfc - Xk\/max{xk,5} (1) 

fe = 

where ^ is a user-specified constant (also referred to as san- 
itary bound as in 25 ) to mitigate the effect of excessively 
small query results. Here we assume that the sanitary bound 
remains same throughout the entire time-series. 

Clearly, the quality of a published series increases as each 
rk approaches Xfc, the extreme case of which would have rk = 
Xfc, for each k. However, a privacy-preserving release is likely 
to perturb original data values in order to protect individual 
privacy. Thus, a publishing mechanism that guarantees user 
privacy and yields high utility is desired. 

2.2 Differential Privacy and Background 

Informally, a mechanism is differentially private if its out- 
come is not significantly affected by the removal or addition 
of a single user. It ensures a user that any privacy breach 
will not be a result of participating in the database since 
anything that is learnable from the database with his record 
is also learnable from the one without his record. 

The formal definition of differential privacy [2], a lso 
referred to as Unbounded Differential Privacy by [14], is 
given as follows. Here the parameter, a, specifies the degree 
of privacy offered. 

Definition 3. [Differential Privacy] A non-interactive pri- 
vacy mechanism A gives a- differential privacy if for any 
dataset Di and D2 differing on at most one record, and for 
any possible anonymized dataset D G Range{A), 

Pr[AiDi) = 5] < X Pr[A{D2) = 5] (2) 
where the probability is taken over the randomness of A. 

Laplace Mechanism. Dwork et al. 8 show that differ- 
ential privacy can be achieved by adding i.i.d. noise to the 
result of each query. The magnitude of the noise added con- 
forms to a Laplace distribution with the probability density 
function p{x\X) — ^e"'"^'/^, where A is determined by both 
the desired privacy level a and the global sensitivity [s] 
of a query which is defined below. 

Definition 4- [Global Sensitivity] For any function f : 
D ^M.^ , the Global Sensitivity of f is 

GS(/)= max (3) 

for all Di , D2 differing in at most one record. For instance, 
the sensitivity of count query is 1. 

Dwork et al. 8 prove that adding Laplace noise of magni- 
tude A = GS{q)/a to the true answer of query q guarantees 
a-differential privacy. 

Composition. The composition properties of differential 
privacy provide privacy guarantees for a sequence of com- 
putations. Any sequence of computations that each pro- 
vides differential privacy in isolation also provides differen- 
tial privacy in sequence, which is known as sequential com- 
position [19| . 

Theorem 1. |19 Let Ai each provide ai- differential pri- 
vacy. A sequence of Ai{D) over the dataset D provides 
ai )- differential privacy. 
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Algorithm 2 Discrete Fourier Transform (D FT) 

Input: Raw time-series X; privacy budget a 
Output: Released time-series R 



compute F'^ = DFT^(X) 
compute F'^ = LPA(F^ a); 
compute R = IDFT(PAD^(F'^)); 
return R; 
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Figure 3: Illustration of Two Existing Methods 



Algorithm 1 Laplace Perturbation Algorithm (LPA) 

Input: Raw time-series X; privacy budget a 
Output: Released time-series R 



Discrete Fourier Transform. Rastogi and Nath [23] pro- 
posed the Fourier Perturbation Algorithm FPAk that trans- 
forms the raw series into frequency domain, perturbs the 
first k coefficients, and then reconstructs the series with 
the perturbed k coefficients. Figure 3(b) illustrates the 
main idea of their method. The outline is shown in Algo- 
rithm [2] It begins by computing F'^, which is composed 



for each i G 0,1,...,T — 1 do 
draw noise from Lap{T / a) 
Tk = Xk+ noise] 

return R; 



In a special case called parallel composition 19 , the com- 
putations operate on disjoint subsets of the data D. The 
ultimate privacy guarantee depends only on the worst guar- 
antee among individual computations, rather than the sum. 
However, it is not applicable in our context since the data 
D at successive timestamps can be highly correlated. 

Given the sequential composition, an overall privacy re- 
quirement a can be considered as a privacy budget^ and 
needs to be allocated among all the queries in a data re- 
lease mechanism in order to guarantee a-differential privacy 
of the released data. 

User- level privacy vs. event- level privacy. The work 
[9I proposed a differentially private continual counter with 
the notion of event-\eYe\ privacy, where the neighboring databases 
differ at i^^, a user u's contribution at timestamp i. In our 
study, we provide a stronger privacy guarantee, user-\eYe\ 
privacy, where the neighboring databases differ at the user 
i.e. u^s contribution at all timestamps, thus protecting 
sensitive information about user u at any time. 

2.3 Existing Solutions 

Here we discuss baseline Laplace perturbation algorithm 
and a recently proposed DFT algorithm. Empirical studies 
of them comparing to our proposed solution are included in 
Section 6. 

Laplace Perturbation Algorithm. The standard Laplace 
mechanism can be applied to perturb data values at each 
timestamp. In other words, T recurring queries, assuming 
count queries, can be issued at each timestamp. Due to 
the composition theorem, if each individual query is a/T- 
differentially private, the sequence of queries guarantees a- 
differential privacy, which leads to a tot al perturbation error 
of the magnitude of B(T). Figure 3 (a) [ illustrates the work- 
flow of this naive Laplace Perturbation Algorithm. The de- 
tailed algorithm is shown in Algorithm^ 



of the first k Fourier coefficients in the Discrete Fourier 
Transform (DFT) of X, with the j^^ coefficient is given as: 

DFT{yi)j = EfJo^e^'^^'^Xi. Then it perturbs F'^ us- 
ing LPA algorithm with privacy budget a, getting a noisy 
estimate F^. This perturbation is to guarantee differen- 
tial privacy. Denote PAD^(F'^) the sequence of length T 
by appending T — k zeros to F'^. The algorithm finally 
computes the Inverse Discrete Fourier Transform (IDFT) of 
PAD^(F'') to get R. The element of the inverse is given 



as: IDFTiX)j = ^ Efjo e 

3. OVERVIEW OF OUR SOLUTION 

In this section, we propose a novel solution to sharing 
time-series data with differential privacy. It allows for fully 
automated adaptation to changing data dynamics and highly 
accurate time-series prediction/estimation. 

3.1 Framework 

It is intuitive that a good adaptation scheme is likely to is- 
sue more queries to the original time-series when data value 
is going through rapid variations and fewer queries when the 
data curve is relatively flat. Furthermore, this information 
of data trend can be inferred from historical queries at no ex- 
tra cost. Therefore, we propose a feedback control solution 
which utilizes the past query results to make decisions about 
the future sampling strategy. In addition, a Kalman fllter is 
used to predict the data values at non-sampling points and 
correct the predictions based on the observed (perturbed) 
values at sampling points. 

The framework of our solution is shown in Figure |4] In a 
feedback loop presentation, the framework is composed of in- 
put, perturbation mechanism, estimation based on Kalman 
filter, adaptive sampling based on PID controller, and out- 
put. 

• The input is a streaming time-series with one aggre- 
gated value at a time. It is sampled by the PID con- 
troller. As a result, not every data value is queried 
from the stream. 

• The Laplace mechanism perturbs each data value 
that actually is sampled by the system in order to 
guarantee differential privacy. 
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Figure 4: Closed chain control loop of our solution 



• At each timestamp, the Kalman filter Prediction pro- 
cedure produces an apriori prediction of the time- 
series based on an internal state model. Upon re- 
ceiving a noisy value from the Laplace perturbation 
at sampling points, the Correction procedure is ac- 
tivated to generate an aposteriori estimate, which is 
achieved by correcting the apriori prediction with the 
observation. 

• The error, between the prior estimate and the poste- 
rior estimate, is then fed through the adaptive sam- 
pling module with PID controller to determine a 
new sampling rate. 

• The output is a streaming time-series with aposteriori 
estimates at sampling points and apriori estimates at 
non-sampling points, both generated by the Kalman 
filter. 

There are two types of error which we would like to bal- 
ance in our solution: perturbation error and prediction er- 
ror. The perturbation error is introduced by Laplace pertur- 
bation mechanism at sampling points, while the prediction 
error is introduced by the Kalman filter Prediction proce- 
dure at non-sampling points. Clearly, the more we sample, 
the larger perturbation error is introduced, but the predic- 
tion error is reduced due to increasing feedback and vice 
versa. The baseline Laplace perturbation algorithm Algo- 
rithm ^ introduces high perturbation error by querying at 
every timestamp. The DFT algorithm Algorithm [2] miti- 
gates the perturbation error at the cost of reconstruction 
error when recovering the time-series from the Inverse DFT. 
In contrast, our goal is to balance the trade-off between these 
two types of error by adaptively adjusting the sampling rate. 

3.2 Algorithm 

Here we present an overall description for our solution in 
Algorithm [3] We prove that it is a-differentially private. 
More details about Kalman filter and PID controller will be 
further discussed in the next two sections. 

Queries will be issued through the Laplace mechanism for 
the first Ti timestamps to collect enough feedback for the 
PID controller (Line 3-6 in Algorithm [3| . Line 7 initial- 
izes the sampling interval with a predefined minimum value 
mini. For each time step after the first T^, a prior predic- 
tion is generated by KF Predict procedure. 

The feedback control loop is implemented by Line 8 to 17. 
If the current time step is a sampling point (k % interval = 
0) and there's budget left for more queries {query Count < 
M where M is a predefined bound for maximum number 



Algorithm 3 Adaptive time-series release algorithm 
Input: Raw time-series X; privacy budget a 
Output: Released time-series R 



1 
2 
3 
4 
5 
6 
7 
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10 
11 
12 
13 
14 
15 
16 
17 
18: 



k, query Count ^ 0; 
M/a 

while query Count < Ti do 

draw noise ^ Lap{X); 

release rk ^ Xk + noise; 

k ^ k -\- 1, query Count ^ query Count + 1; 
interval ^ minP, 

for each k G {Ti,Ti + 1, T - 1) do 

obtain estimate prior from KFPredict(A:); 
if k % interval = and query Count < M do 
Zk ^ perturb Xk by Lap{X); 
obtain estimate posterior from KFCorrect(A:); 
release rk ^ posterior; 

interval ^ max {PIDControl(A:), mini} ; 
query Count ^ query Count + 1; 
else 

release rk ^ prior; 
return R; 



of queries allowed) , a noisy measurement is retrieved from 
Laplace perturbation (Line 11), and it is used by KFCorrect 
to obtain an updated estimate posterior. Line 13 pub- 
lishes the posterior. A new interval is determined by the 
PIDControl output from Line 14. Note we always keep the 
interval above the minimum length. If either condition in 
Line 10 evaluates to false. Line 17 publishes the prediction 
prior to save the privacy budget. 

Theorem 2. Algorithm\^satisfies a- differential privacy. 

PROOF. Each query issued through the Laplace mecha- 
nism is a/M- differentially private. The maximum number 
of queries allowed is M. By the composition rule (Theorem 
1), Algorithm [3] satisfies a-differential privacy. 

4. ESTIMATION 

In this section, we present the formal Kalman filter model 
we adopt for estimating data values at each timestamp and 
describe the details of the Kalman filter prediction and cor- 
rection procedures. 

4.1 The Kalman Filter Model 

The Kalman filter was introduced in 1960 by R. E. Kalman [13] 
as a recursive solution to the discrete data linear filtering 
problem. Since then, it has found application in the fields of 
data smoothing, process estimation, and object tracing, to 
name a few. The Kalman filter addresses the general prob- 
lem of trying to estimate the internal state of a discrete-time 
controlled process that is governed by a linear stochastic 
difference equation. Since the internal state of a process is 
usually unavailable, the Kalman filter estimates the process 
using feedback control: the filter estimates the process state 
according to the linear equation at some time and it obtains 
feedback in the form of (noisy) measurements. It aligns 
perfectly with our time-series sharing scenario where only 
perturbed data values are available. We are then inspired 
to model the time-series data with a process model, and 



to treat observations out from Laplace perturbation mecha- 
nism as noisy measurements. We now introduce the formal 
Kalman filter model in our context. 

Process Model. In the time-series publishing scenario, we 
adopt a constant process model for the time-series which is 
given by the following equation: 

Xfe+i = Xk+oJ (4) 

where k is the discrete time index. This constant system 
model states that adjacent data values from the original 
time-series should be consistent except for a white Gaus- 
sian process model noise uj: 

pH~iV(0,Q) (5) 

where Q is the covariance of cj. 

Measurement Model. The observation, obtained by Line 11 
in Algorithm [3] is perturbed data out from the Laplace 
mechanism, and can be modeled by: 

Zk = Xk + iy (6) 

where ly, the measurement noise, is a Laplacian noise which 
follows: 

p(i/)~Lap(0,A) (7) 

where A is the magnitude parameter determined by differ- 
ential privacy mechanism. 

For computational efficiency and by the guidance of com- 
mon practice, we will use a small, white Gaussian error to 
approximate the actual noise distribution Lap{X). Thus we 
define the distribution of u as 

pH~iV(0,i?) (8) 

where R is the covariance of ly. 

Prior and Posterior Estimates. At time step /c, the 
apriori state estimate Xk~ is made based on the system 
model (4) and is related to the aposteriori state estimate of 
last step: 

Xk~ = Xk-i. (9) 

The aposteriori state estimate Xk is based on a linear com- 
bination of apriori state estimate Xk~ and a weighted pre- 
diction error ipk- The aposteriori estimate is calculated as 
follows: 

Xk = Xk~ + Kkipk- (10) 

This error ipk is called innovation, which captures the dif- 
ference between actual measurement and measurement pre- 
diction. It is given as 

^l^k = Zk - Xk~ • (11) 

The value of the weight Kk is called Kalman Gain which 
is adjusted with each measurement. In order to reduce un- 
certainty of the aposteriori estimate Xk^ Kk is chosen to 
minimize the aposteriori error covariance Pfc, which is de- 
fined as 

Pk = E[{xk - Xk){xk - Xk)^]- (12) 
Symmetrically, we define the apriori error covariance 

as 

P^ = E[{xk - Xk~){xk - Xk~)^]. (13) 




(1) Compute the Kalman gain: 

(1 ) Project the state ahead: Kk = {Pk + ^)"^ 

2'/^: = X]^-i (2) Update estimate with measurement: 

(2) Project the error covariance ahead: jf^. = + Kki^k ~ ^k~) 

Pk~ — Pk-i + Q (3) Update the error covariance: 




Figure 5: A complete picture of the Kalman filter 

Applying the least-square method to (12), we get 

K,=P^{P^ +R)-\ (14) 

By (10,12,13), we get 

P, = (l-7f,)p-. (15) 

The projection of the apriori error covariance can be 
derived given (4,9,12,13), and Pk-i, resulting 

Pfc- =Pfc_i+Q. (16) 

4.2 Prediction and Correction 

The sheer advantage of the Kalman filter lies in the fact 
that it maintains and updates the best estimate of the inter- 
nal state by properly weighing and combining all available 
data (state prior and noisy observations by the differential 
privacy mechanism) to form an educated guess. It accom- 
plishes that by repeating the following two mechanisms [24] : 

• Prediction: This mechanism projects forward in time 
the current state and error estimates to obtain the 
apriori estimates for the next step. At time step k, 
the filter predicts the values of the internal state and 
the error covariance at time step k -\- 1. 

• Correction: This mechanism is responsible for the feed- 
back, i.e. for incorporating a new measurement into 
the apriori estimate to obtain an improved aposteriori 
estimate. At time step k when an actual measure- 
ment is available, the filter corrects itself based on the 
innovation. 

Figure [5] gives a high-level diagram of the two operations 
of the Kalman filter. After each prediction and correction 
pair, the process is repeated with the previous aposteriori 
estimate Xk-i used to project or predict the new apriori es- 
timate. The prediction mechanism is implemented by KF- 
Predict in Algorithm [4] Note that we always derive apriori 
state estimate using the most recent published value rk-i- 
The correction mechanism is implemented by KFCorrect in 
Algorithm [5] Note that we obtain measurement Zk as the 
noisy observation from Laplace perturbation mechanism. 

Since each noisy observation from Laplace mechanism comes 
with a cost (privacy budget spent), we are motivated to sam- 
ple data values through the differential privacy interface only 
when needed in our overall solution. As a result, a noisy ob- 
servation, which is crucial to the Correction step, may not 
be available at all times. Therefore, we propose to release 
the prior estimate when the observation is absent and to 
correct the released value when the observation is available. 
The detailed sampling strategy is described in next section. 



Algorithm 4 KFPredict(/c) 

Input: Previous published value rfc-i; aposteriori error co- 
variance Pfe-i 

Output: Apriori state estimate Xk~ , error covariance 

1: Xk~ ^ Tk-i; 

2: P- ^ Pk-i + Q; 

3: return {xk~,Pj^); 



Algorithm 5 KFCorrect(/c) 

Input: Apriori state estimate Xk~; measurement Zk; apri- 
ori error covariance P^ 

Output: Aposteriori state estimate Xk, aposteriori error 
covariance Pk 

1: 7f,^P^-(p-+i?)-i; _ 
2: affe <- affe + Kk{zk - Xk ); 
3: Pk ^ (l-A-fc)Pfe-; 
4: return (affe, Pfc); 



5. SAMPLING STRATEGY 

In this section, we present our adaptive sampling strategy 
for issuing queries. To motivate and facilitate our discus- 
sion, we first present a fixed-rate sampling strategy, and 
then describe the details of the adaptive sampling strategy 
based on FID control which aims to achieve near optimal 
sampling interval based on the data dynamics. 

5.1 Fixed Rate Sampling 

A naive approach is to use fixed-rate sampling and to pre- 
dict at non-query points. Given an interval /, the fixed-rate 
algorithm samples the time series periodically and publishes 
the perturbed value per / time units. As for the time points 
between two adjacent queries, an estimate of the data value 
is published. 

Algorithm |6] shows a sketch of the algorithm. The variable 
M represents the total number of queries allowed and the 
individual budget for each query is equivalent to a/M. Line 
10 indicates that for the non-sampling points, we use the 
prior estimate given by the Kalman filter. 

A sampling strategy making M queries in total leads to 
an overall perturbation error of G(M), rather than G(T) 
(T > M) by the baseline Laplace perturbation algorithm. 
Although we are not able to give a bound for the prediction 
error as the time-series we consider is very generic, we may 
consider that it is in relation to the number of non-sampling 
points T — M. 

The challenge of fixed-rate sampling is in defining the sam- 
pling rate, i.e. the total number of samples allowed, M. 
Increasing the sampling rate, i.e. when M is high, an ex- 
treme case of which is to issue a query at each time step as 
in the baseline Algorithm [l] the overall perturbation error 
will grow with M. On the other hand when we decrease 
the sampling rate, i.e. when M is low, the perturbation at 
each sampling point will drop, but the published series will 
not reflect up-to-date data values, resulting large prediction 
error. Apriori knowledge of the data is required to find 
the optimal sampling rate in order to better describe the 
real-time data trend and minimize the average relative er- 
ror. However, that is impractical for a real-time publishing 
scenario. 



Algorithm 6 Fixed Rate Sampling 

Input: Raw time-series X; privacy budget a; fixed query 
interval / 

Output: Perturbed publish R 

1: M ^ X./en^t/i//; 
2: A ^ M/a; 

3: for each /c G 0, 1, T - 1 do 
4: prior ^ KFPredict(A;); 
5: if/c%/ = Odo 

6: perturb Xk by Lap{X); 

7: obtain estimate posterior from KFCorrect(/c); 

8: rk ^ posterior; 

9: else 

10: rk ^ prior; 

11: return R; 



With no apriori knowledge of the time series, it is de- 
sirable and necessary to detect data dynamics and to ad- 
just the sampling rate on-the-fly. For instance, when data 
value shows stability based on previous samples, the sam- 
pling rate should decrease to save the privacy budget for the 
future. On the other hand, if data is going through rapid 
changes, the sampling rate should increase to capture the 
data changes and to improve utility. 

5.2 Adaptive Sampling with Control 

We adopt a controller in our overall solution to adaptively 
adjust the rate of sampling. Here we introduce the idea of 
feedback control and present details of the PID controller 
for the time-series application. 

Feedback Control. A typical feedback control system 
starts with a measurement of the system output. The mea- 
surement is then compared with a desired value to generate 
the tracking error. The error is fed through a controller to 
generate a control action which will be further implemented 
into the input to the system. Then the system output will 
be monitored by the sensor to start another iteration. 

The notation System is the process to regulate. In the 
adaptive sampling module, it represents the sampling pro- 
cess. The Controller component specifies the manner with 
which the error information is utilized to adjust the sampling 
rate. We describe the error feedback and the PID controller 
used below. 

Error Feedback. The feedback to the controller in our so- 
lution is the relative error between the aposteriori estimate 
and the apriori estimate at a particular time step. At time 
step kn (0 < /cn < T), where the subscript indicates the 
n-th sampling point (0 < n < M), we define this error Ek^ 
as follows: 

Definition 5. [Model Error] At a given time step kn, the 
error Ek^ is defined as 



max{xkn , o\ 

Note that the posterior estimate is only available when given 
a noisy observation at time step kn ■ Thus no error is defined 
at non-sampling point. 

The model error defined above is based on the amplified 
innovation error Kkipk according to Equation (10). Clearly, 



the model error measures how weh the internal state model 
describes current data dynamics, suppose the aposteriori es- 
timate is close to the true value. Since the state prior 
is given by a constant state model, we may infer that data 
is going through rapid changes if the error Ek^ increases 
with time. In response, the controller in our system will 
detect the errors and adjust the sampling rate accordingly. 
We will take a closer look at how the controller takes action 
correspondingly. 

PID Controller. The PID control algorithm is the most 
common form of feedback control and the foundation of al- 
most all basic control applications King 15^ defines 
three types of action by a PID controller: Proportional, In- 
tegral, and Derivative. For our application, we design a PID 
controller that measures the performance of our sampling 
process over time in terms of a compound error. Now we 
further re-define its three components. 

• Proportional error is to keep the controller output 
(A) in proportion to the current error Ek^ with kn 
being the current time step and subscript n being the 
sampling point index 



Algorithm 7 PIDControl(A:n) 



CpEk^ 



(18) 



where Cp denotes the proportional gain which ampli- 
fies the current error. 

• Integral error is to eliminate offset by making the rate 
of change of control output proportional to the error. 
With similar terms, we define the integral control as 
follows: 



j = n-Ti + l 



(19) 



where d denotes integral gain amplifying the integral 
error and Ti represents the integral time indicating 
how many recent errors are taken. 

• Derivative error attempts to prevent large errors in 
the future by changing the output in proportion to the 
rate of change of error. 



Ek 



where Cd is derivative gain amplifying the derivative 



The full PID algorithm we have developed so far is thus 

(21) 



Ci V — > Ek^ — Ek^_ 

A = CpEk^ + 7=r 2^ Ek^ + Cd u _u 



Control gains Cp, and Cd denote how much each of the 
proportional, integral, and derivative counts for the final cal- 
ibrated PID error. We further constrain them to be non- 
negative and their summation is equal to 1: 



Cp, Ci, > 
Cp + Ci + Cd — 1 



(22) 
(23) 



The three gains as well as integral time Ti are system pa- 
rameters to be set in our application. 



Input: Feedback errors Ek^ 
Output: New interval I' 



771 



1: A ^ CpEk^ + ^ E^=n-Ti + l ^fcj + fe'-fc^"^!/; 

2\ I' ^ max{I + 0(1 - e ^ ),minl}; 
3: return I'; 



Given the PID error A, a new sampling interval I' can be 
determined by the following equation: 



I' ^ 1 + 9(1 -e' 



(24) 



where and ^ are pre-determined parameters. 

A procedure that implements the PID controller is shown 
in Algorithm [7| Notice that we will keep the sampling inter- 
val above a minimum threshold mini. An overall algorithm 
with adaptive sampling was shown in Algorithm 3. As our 
approach might issue fewer than M queries, the overall per- 
turbation error introduced can be bounded by 0(M). 

6. EXPERIMENT 

We have implemented our approach in Java with JSC 
(Java Statistical ClasseQ for simulating the Laplace dis- 
tribution. We empirically study the predictability and con- 
trollability of our proposed approach and compare with al- 
ternative methods in terms of utility. 

Our study has been conducted with three real time-series 
data sets: flu, traffic, and unemployment. 

• Flu is from the weekly surveillance data of Influenza- 
like illness provided by the Influenza Division of the 
Centers for Disease Control and Preventior0 We col- 
lected the weekly outpatient count of the age group 
[5-24] from 2006 to 2010. This time-series consists of 
209 data points. 

• Traffic is a daily traffic count data set for Seattle-area 
highway traffic monitoring and control provided by the 
Intelligent Transportation Systems Research Program 
at University of Washingtorj^ We chose the traffic 
count at location 1-5 143.62 southbound from April 
2003 till October 2004. This time-series consists of 
540 data points. 

• Unemployment describes the monthly unemployment 
level of black or African American women of the age 
group [16,19] from ST. Louis Federal Reserve Banlrl 
This data set contains observations from January 1972 
to October 2011 with 478 data points. 

Table 2 summarizes the default setting of parameters used 
in our experiments. Unless otherwise specified, the listed pa- 
rameters take on their corresponding default values. This is 
not an optimal way of setting parameters. However we have 
been able to observe excellent performance by our approach. 
We will study the impact of individual parameters in next 
few sections. 



^http:/ /www. jsc.nildram.co.uk 
^http:/ /www. cdc.gov/ flu/ 
^http:/ /www. its.washington.edu/ 
^http: / / research.stlouisfed.org/ 
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Figure 6: Prediction Accuracy with three data sets 



Table 2: Experiment Default Setting 



Parameter 


Default Value 


6 


1 


a 


0.01 


Q 


0.001 


R 


10 


Cp 


0.9 


a 


0.1 


Cd 





M 


25% X T 


Ti 


5 


e 


10 




1000 


mini 


1 



6.1 Accuracy and Robustness of Prediction 

We first evaluate the accuracy of the Kalman filter esti- 
mation across all three data sets, compared against a sim- 
ple linear predictor and pure Laplace perturbation from 
Algorithm [l] To this end, the experiment is set up in a 
way such that a noisy observation is obtained at each time 
step. The Kalman filter repeats the prediction- correction 
pair and releases the posterior estimate at each time step. 
The linear predictor generates a prediction based on the 
linear model but releases pure perturbed answers without 
correction. The relative error is calculated comparing the 
estimate/prediction against the true value and the average 
relative error, defined by Equation (1), over the entire se- 
ries is plotted in the results. We measure the performance 
with respect to different scales for privacy budget a. The 
linear predictor is designed to fit a line with 2 data points 
as the data sets show little linearity in a larger scale. We set 
the the process noise covariance Q, defined in Equation (5), 
to be 0.001 and the measurement noise covariance R, de- 
fined in Equation (8), to be 10 since we assume constant 
state model and small measurement noise. As shown in Fig- 
ure [6] linear predictor is constantly worse than pure Laplace 
perturbation, because of the extra linear model error. The 
Kalman filter outperforms linear predictor as well as Laplace 
perturbation algorithm across all three data sets especially 
when given small privacy budgets(a = 0.0001,0.001,0.01). 
With large budget {a = 1), which we note does not provide 
sufficient privacy protection, we observed no substantial ad- 
vantage of using the Kalman filter, which can be explained 
by the nature of the aposteriori estimate defined by Equa- 
tion (10): it only partially relies on the noisy measurement 




(a) Impact of Q on Predic- (b) Impact of R on Predic- 
tion Accuracy tion Accuracy 

Figure 7: Impact of Noise Parameters on Prediction 
Accuracy with Unemployment data set 



Zk hence does not fully refiect reduced perturbation error. 

6.2 Effects of Kalman Filter Parameters 

To understand the impact of process noise covariance Q 
and measurement noise covariance R on the accuracy of 
Kalman filter estimation, we vary their values independently 
when performing the prediction experiment. We present the 
results with the unemployment data set, since the other data 
sets show similar trends. Given a is 0.01, results are pre- 
sented in Figure [t] In Figure 7(a) we plot the average 
relative error with respect to different scales for Q and we 
observe similar trends among three R values. With R fixed, 
we can see that the accuracy of Kalman filter drops as Q 
increases. In Figure 7(b) we plot the average relative error 
with respect to different scales for R using three Q values. 
Given Q fixed, the accuracy improves as R increases. This 
can be explained by Equation (14,16) which define how the 
Kalman gain is calculated. Increasing Q, the Kalman gain 
increases, resulting the aposteriori estimate, i.e., the final 
released value, favoring the noisy observation. Increasing 
R, the Kalman gain decreases, resulting the aposteriori es- 
timate favoring the apriori state prediction. Both results 
consistently state that it's beneficial to rely more on the 
state prior than the noisy measurement with a small pri- 
vacy budget, which means the observed values have larger 
noise. 



6.3 Accuracy and Robustness of Adaptive Sam- 
pling 

As the Kalman filter provides satisfactory estimations with 
limited privacy budget, we now examine the performance of 
adaptive sampling with respect to fixed-rate, given the pri- 
vacy budget a is 0.01. We vary the fixed sampling interval 
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Figure 9: Impact of Control Parameters on Accu- 
racy with Traffic data set 



from 1 to 10, corresponding to the sampling rate from 100% 
to 10%. For our adaptive approach, we set M to be 25%T. 

The results with all three data sets are shown in Figure [S] 
As the s ampling interval increases, i.e. from 1 to 5 in Fig- 
fixed-rate sampling shows reduced average relative 



8(a; 



error of various scales. This phenomenon can be interpreted 
by the decrease of total perturbation error resulting from 
less frequent queries. As the interval further increases, from 
7 to 10 as in Figure 8(a) and from 7 to 10 as in Figure 8(b) 
the average relative error starts to rise again. This can be ex- 
plained by the accumulation of prediction error, due to long- 
term prediction without correction. The optimal sampling 
interval is not known apriori and may differ from dataset to 
dataset. As in Figure [S] we found that the performance of 
adaptive sampling is comparable to the optimal fixed-rate 
across all data sets, thanks to the adaptive strategy. 

6.4 Effects of PID Parameters 

We first examine the effect of M, the maximum allow- 
abl e num ber of queries, with traffic data set shown in Fig- 
With the ratio of M against T increasing, the 
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average relative error grows, which can be explained by the 
accumulation of perturbation error. Figure 9(b) studies the 



impact of control gains as opposed to the integration time. 
Fixing the case Cp = 1, = 0, = as the baseline, 
the difference between the performance of every other set- 
ting and that of the baseline is plotted. In addition to the 
constraints from Equation (22, 23), we choose the control 
gains according to the common practice: proportional > 
integral > derivative. As seen in the plot, when the in- 
tegration time increases, the resulting relative error shows 
no clear trend. Furthermore, between different control gain 
settings, the variation of error is small enough so that we 
believe it's due to the randomness of Laplace perturbation 
errors. Therefore we conclude that there's no extra "rule of 
thumb" beside the common practice for tuning the controller 
gains in our application. 

We also study the impact of ^ and as in Equation (24) 
on the accuracy of the published data series. Fixing either 
one of them, varying the value of the other does not result in 
substantial changes in terms of relative error. We consider 
both of them not influential and exclude the detailed figures. 

6.5 Overall Performance vs. Alternative Ap- 
proaches 

To compare our method with respect to alternative ap- 
proaches in Section 2, we also implemented Algorithm^ as 



well as Algorithm |2] The number of DFT coefficients to 
preserve, /c, is set to be 20, which is near optimal accord- 
ing to the study of [23]. We plot the average relative error 
by our adaptive approach, Laplace perturbation algorithm, 
and Discrete Fourier Transform (DFT) algorithm with re- 
spect to different privacy budget scales. Results are shown 
in Figure [To] Again, the adaptive approach shows superior 
performance when the privacy budget a is limited. This 
confirms our hypothesis that with accurate estimate by the 
Kalman filter, the PID control mechanism can adjust the 
sampling rate as needed, thus improving the overall utility 
of the published series. Note that when the privacy budget 
is high and approaching 1 , the baseline Laplace perturbation 
algorithm achieves smaller relative error because of the re- 
duced perturbation error. Since the reconstruction error of 
the DFT approach and the prediction error of our adaptive 
approach both outweighs the perturbation error in this case, 
their released series contain larger relative error. However, 
a privacy budget greater than 1 does not provide sufficient 
privacy protection any more. We find that our adaptive ap- 
proach outperforms the alternative methods under strong 
privacy guarantee. 

7. RELATED WORK 

Here we give a brief review over existing works related to 
differential privacy, time series, and the Kalman filter. 

Differential privacy on static data. Dwork et al. ^ 
established the guideline to guarantee differential privacy 
for individual aggregate queries by calibrating the Lapla- 
cian noise to the global sensitivity of each query. Since 
then, various mechanisms have been proposed to enhance 
the accuracy of differentially private data release. Blum et 
al. 2j proved the possibility of non-interactive data release 
satisfying differential privacy for queries with polynomial 
VC-dimension, such as predicate queries. Dwork et al. 1^ 
further proposed more efficient algorithms to release private 
sanitization of a data set with hardness results obtained. 
The work of Hay et al. 11 improved the accuracy of a tree 
of counting queries through consistency check, which is done 
as a post-processing procedure after adding Laplace noise. 
This hierarchical structure of queries is referred to as his- 
tograms by several techniques 11, 17, 27 , where each level 
in the tree is an increasingly fine-grained summary of the 
data. The work by Xiao et al. [2^ proposed a two-phase 
partitioning algorithm using kd-tree structure to improve 
the accuracy of released histograms. Li et al. [17| studied 
the feasibility of providing an optimal query strategy by ana- 
lyzing a workload of counting queries apriori and estimating 
answers from the query strategy. Xiao et al. |26| proposed 
another approach based on the Harr wavelet which trans- 
forms original data summary before adding Laplace noise to 
it. Another recent study 25 , aiming to reduce the relative 
error, suggests to inject different amount of Laplace noise 
based on the query result and works well with multidimen- 
sional data. Several other works studied differentially pri- 
vate mechanisms for particular kinds of data, such as search 
logs 16 and set-valued data [g]. When applied to highly 
self-correlated time-series data, all the above methods, de- 
signed to perturb static data, become problematic because 
of highly compound Laplace perturbation error. 

Time series. Time series data is pervasively encountered in 
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the fields of engineering, science, sociology, and economics. 
Various techniques [s], such as ARIMA modeling, exponen- 
tial smoothing, ARAR, and Holt- Winters methods, have 
been studied for time-series forecasting. Papadimitriou et 
al. 22 studied the trade-offs between time-series compress- 
ibility property and perturbation. They proposed two algo- 
rithms based on Fast Fourier Transform (FFT) and Discrete 
Wavelet Transform (DWT) respectively to perturb time- 
series frequencies. But the additive noise proposed by them 
does not guarantee differential privacy, meaning it does not 
protect sensitive information from adversaries with strong 
background knowledge. Rastogi and Nath [23] proposed a 
Discrete Fourier Transform (DFT) algorithm which imple- 
ments differential privacy when perturbing time-series data. 
However, the DFT algorithm cannot hide data on-the-fly in 
a streaming environment. We are also aware of the recent 
work by Dwork et al. 9 on continuous data streams. They 
defined the event-level privacy to protect an event, i.e. one 
user's presence at a particular time point, rather than the 
presence of that user. If one user contributes to the aggrega- 
tion at time point t — 1, t, and t + 1, the event-level privacy 
hides the user's presence at only one of the three time points, 
resulting the rest two open to attack. 

Kalman filter. R.E. Kalman published the seminal paper 
on the Kalman filter 13 in 1960. Since then, it has become 
widely applied to areas of signal processing 4 and assisted 
navigation systems 1 . It has also gained popularity in other 
areas of engineering. One particular application is to wire- 
less sensor networks. Jain et al. 12 adopted a dual Kalman 
filter model on both server and remote sensors to filter out 
as much data as possible to conserve resources. But their 
main concern was to minimize memory usage and commu- 



nication overhead between sensors and the central server 
by storing dynamic procedures instead of static data. Dis- 
tributed Kalman filtering 20 by Olfati-Saber was deployed 
to a network of sensors to reach a consensus of estimate 
among neighboring sensor nodes. The complexity and sta- 
bility of such deployment was formally studied in |21j . 

8. CONCLUSION 

We have proposed an adaptive approach with sampling 
and estimation to release time series under differential pri- 
vacy. The key innovation is that our approach utilizes feed- 
back loops based on observed (perturbed) values to dynam- 
ically adjust the prediction/estimation model and the sam- 
pling rate. To minimize the overall privacy cost, the solution 
uses the PID controller to adaptively sample long time-series 
according to detected data dynamics. As to improve the 
accuracy of data release per timestamp, the Kalman filter 
is used to predict data values at non-sampling points and 
to estimate true values from perturbed query answers at 
sampling points. Our experiments with three real data sets 
show that it is beneficial to incorporate feedback into both 
the prediction model and the sampling process. The results 
confirmed that our adaptive approach improves accuracy of 
time-series release and has excellent performance even under 
very small privacy cost. 

As for the future, we plan to develop an improved filter to 
handle a Laplacian measurement noise, as defined by Equa- 
tion (7), instead of approximating it with a Gaussian noise. 
A good candidate is the Masreliez filter proposed in [Ts] 
and we need to overcome the difficulty of implementing the 
convolution operation involved in evaluating the score func- 
tion of the Masreliez filter. Another potential direction is to 



expand our solution to publish differentially private spatial- 
temporal statistics, for example, real-time traffic conditions 
at all intersections of a city. 
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